2023 iThome 鐵人賽

DAY 18

AI & Data

全端 LLM 應用開發(向量資料庫, Hugging Face, OpenAI, Azure ML, LangChain, FastAPI and more)系列第 18 篇

全端 LLM 應用開發-Day18-用 Milvus 儲存向量資料

15th鐵人賽

大魔術熊貓工程師

2023-10-03 19:57:32

8517 瀏覽

分享至

全端 LLM 應用開發-Day18-用 Milvus 儲存向量資料

昨天我們完成 Pinecone 的基礎使用了，接著我們來介紹一些開源的向量資料庫，今天要介紹的是 Milvus。

Milvus 是一個專為向量搜尋和各種 AI 應用而設計的開源向量資料庫。該項目於2019年10月在 Apache License 2.0 許可下推出，目前是 LF AI & Data Foundation 下的一個畢業項目。可以開源本地自架，也有提供雲端託管的服務。

目前的版本已經來到 Milvus 2.0，支援雲原生，具有經過設計的儲存和計算分離功能。這種分離對於增強彈性和靈活性至關重要。而且 Milvus 2.0 的架構完全是無狀態的，進一步有助於其適應性和可伸縮性。

在性能方面，Milvus 能在毫秒級別上對萬億向量資料集進行搜索，平均延遲也以毫秒來衡量。

本地安裝

可以去這裡把 docker compose 下載下來： https://github.com/milvus-io/milvus/releases/ 。裡面還有其他的模組，例如說 etcd 和 minio，這部份我們就先不理他，一并開起來吧！容器化就是這麼沒有負擔 😂😂。不過如果你還是覺得這些很多餘的話，也可以用 Bitnami 包好的 image： https://hub.docker.com/r/bitnami/milvus/ 。
注意官方的 docker compose 有 GPU 版本的，不過我們窮人，就先用普通版本的。接著我們就使用指令 docker-compose up -d 把它在本地端跑起來。
接著可以去 http://localhost:9091/healthz 看是否正確跑起來，如果有的話，就會顯示 OK。
接著我們使用 poetry add pymilvus 來安裝他的 Python SDK。

Milvus 使用

Milvus 的觀念和 Pinecone 不太一樣，我們一開始要先建立 database，然後還要再建立 schema 和 collection，再來才是插入向量資料。

使用下面程式碼建立 database，第一次建立完後就可以註解掉了。

from pymilvus import (
    connections,
    db,
)

connection = connections.connect("defa，ult", host="localhost", port="19530")
database = db.create_database("lyrics")
print(db.list_database())

一樣把 OpenAI embedding 放進來。

def get_embedding(text, model_name):
    response = openai.Embedding.create(
        input=text,
        engine=model_name
    )
    return response['data'][0]['embedding']
    
def prepare_embeddings(text_array, model_name):
    return [get_embedding(text, model_name) for text in text_array]

接著我們定義一個 function，來建立 schema 和 collection，這裡就很像傳統資料庫的欄位建立，和昨天 Pinecone 的 metadata 的方式不太一樣。值得注意的是 index_params，這裡我們使用 HNSW 和 COSINE，以及 M 是 Maximum degree of the node 和 efConstruction 是 Search scope。

def create_milvus_collection(collection_name):
    fields = [
        FieldSchema(name='id', dtype=DataType.INT64,
                    descrition='Ids', is_primary=True, auto_id=False),
        FieldSchema(name='lyric', dtype=DataType.VARCHAR,
                    description='lyric texts', max_length=500),
        FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR,
                    description='Embedding vectors', dim=1536)
    ]
    schema = CollectionSchema(fields=fields, description='Lyrics collection')
    collection = Collection(name=collection_name, schema=schema)

    index_params = {
        'index_type': 'HNSW',
        'metric_type': 'COSINE',
        'params': {'M': 16, 'efConstruction': 500}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

接著的 function 來插入資料。

def insert_to_milvus(collection, text_array, embedding_array):
    entities = [
        {"id": i, "lyric": text_array[i], "embedding": embedding_array[i]}
        for i in range(len(text_array))
    ]
    collection.insert(entities)

插入資料後，就是搜尋的 function 啦！

def search_from_milvus(collection, query_embedding, k=1):
    search_params = {
        "metric_type": "COSINE"
    }

    results = collection.search(
        data=[query_embedding],  
        anns_field="embedding",  # Search embedding 的欄位
        param=search_params,
        limit=k,  
        output_fields=['lyric']  # 包含進來要輸出的欄位
    )

    ret = []
    for hit in results[0]:
        row = []
        # 取得 ID, 距離, 和歌詞
        row.extend([hit.id, hit.score, hit.entity.get('lyric')])
        ret.append(row)
    return ret

最後就是我們的主程式了。注意要做 collection.load() 才可以哦！

def main(connection, collection):
    COLLECTION_NAME = "Lyrics_collection"

    # 如果已經創建過 collection 就註解掉這裡
    # if utility.has_collection(COLLECTION_NAME):
    #     utility.drop_collection(COLLECTION_NAME)
    # collection = create_milvus_collection(COLLECTION_NAME)

    EMBEDDING_MODEL_NAME = "embedding-ada-002"
    openai.api_base = "https://japanopenai2023ironman.openai.azure.com/"
    openai.api_key = "yourkey"
    openai.api_type = "azure"
    openai.api_version = "2023-03-15-preview"

    text_array = ["我會披星戴月的想你，我會奮不顧身的前進，遠方煙火越來越唏噓，凝視前方身後的距離",
                "而我，在這座城市遺失了你，順便遺失了自己，以為荒唐到底會有捷徑。而我，在這座城市失去了你，輸給慾望高漲的自己，不是你，過分的感情"]
    embedding_array = prepare_embeddings(text_array, EMBEDDING_MODEL_NAME)

    insert_to_milvus(collection, text_array, embedding_array)

    query_text = "工程師寫城市"
    query_embedding = get_embedding(query_text, EMBEDDING_MODEL_NAME)
    results = search_from_milvus(collection, query_embedding, k=1)
    print(f"尋找 {query_text}:", results)


if __name__ == '__main__':
    connection = connections.connect("default", host="localhost", port="19530")
    collection = Collection("Lyrics_collection")
    collection.load()
    main(connection, collection)

最後會得到結果如下，數值和之前的向量資料庫大同小異。

尋找 工程師寫城市: [[1, 0.7930288314819336, '而我，在這座城市遺失了你，順便遺失了自己，以為荒唐到底會有捷徑。而我，在這座城市失去了你，輸給慾望高漲的自己，不是你，過分的感情']]

全端 LLM 應用開發-Day17-用 Pinecone 儲存向量資料

全端 LLM 應用開發-Day19-用 Weaviate 儲存向量資料

系列文

全端 LLM 應用開發(向量資料庫, Hugging Face, OpenAI, Azure ML, LangChain, FastAPI and more) 共 30 篇

RSS系列文訂閱系列文

75 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

113 組

團體組數

5 組

累計文章數

175 篇

最後報名日

9/15

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

ChatGPT Business & Codex 如何從零開始?

IT邦幫忙

全端 LLM 應用開發(向量資料庫, Hugging Face, OpenAI, Azure ML, LangChain, FastAPI and more)系列 第 18 篇